Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Improve bulking in Gluon #13890

Merged
merged 3 commits into from
Jan 30, 2019
Merged

Improve bulking in Gluon #13890

merged 3 commits into from
Jan 30, 2019

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Jan 15, 2019

Description

This PR improves performance of hybridized Gluon models on the GPU by better bulking of ops (running GPU ops without synchronization).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented:
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • For models hybridized with hybridize() engine currently creates ImperativeBulk op that combines multiple operations into 1. However, this ImperativeBulk op captures entire ops, including synchronization at the end. Since the engine dependencies are only updated at the very end of the ImperativeBulk op, having this synchronization is wasteful. This PR changes the RunContext structure so that ImperativeBulk op is able to inform the bulked ops that it will handle the synchronization itself. Imperative ops check that argument to determine whether they need to perform their own synchronization.
  • For models hybridized with hybridize(static_alloc=True, static_Shape=True) there is currently another mechanism of bulking, similar to the one used in symbolic API. However, it very aggressively excludes ops from being eligible for bulking - any op that produces output from forward pass (which means most of the ops) cannot be bulked. This is artificial limitation and this PR lifts it.

Comments

  • With this PR hybridize(static_alloc=True) and hybridize(static_alloc=True, static_shape=True) become pretty similar in performance for single GPU tests. The main 2 differences seem to be:
    • a difference in the time it takes to start the forward pass (which I did not yet investigate, but non-static shape seems to take more time to start)
    • the fact that ImperativeBulk is a temporary ThreadedEngineOpr destroyed at the end of its invocation, which seems to be a relatively expensive operation (hybridized model with everything static mostly does not use ImperativeBulk so does not pay the cost of destruction of the object).
  • I will post some performance comparisons in a future comment.

@eric-haibin-lin @piiswrong

@ptrendx
Copy link
Member Author

ptrendx commented Jan 15, 2019

Perf comparison:
ResNet-50 v1d from GluonCV on synthetic data, using slightly modified script to enable synthetic data, DGX1-V 32G.

  • Default parameters from GluonCV script (128 per GPU, 8 GPUs, NAG optimizer)
    • Without this PR: 5700 imgs/s
    • With this PR: 5910 imgs/s
  • Small batch (32 per GPU), 1 GPU, NAG optimizer changed to SGD (as at such small batch lack of true mixed precision version of NAG optimizer skews results a lot):
    • Without this PR: 570 imgs/s
    • Without this PR (static_shape=False): 550 imgs/s
    • With this PR: 620 imgs/s
    • With this PR (static_shape=False): 600 imgs/s

@ptrendx
Copy link
Member Author

ptrendx commented Jan 15, 2019

Also adding @KellenSunderland

@stu1130
Copy link
Contributor

stu1130 commented Jan 16, 2019

Thanks for your contribution @ptrendx
@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Jan 16, 2019
@ptrendx
Copy link
Member Author

ptrendx commented Jan 16, 2019

The issue with test_dropout is unrelated, details in #9816

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @piiswrong Could you double check this PR?

@eric-haibin-lin eric-haibin-lin merged commit 89c7d57 into apache:master Jan 30, 2019
stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019
* Improve bulking in Gluon

* Trigger CI
vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019
* Improve bulking in Gluon

* Trigger CI
haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019
* Improve bulking in Gluon

* Trigger CI
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants